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Noise and Error 
Roadmap 
@ When Can Machines Learn? 
@ Why Can Machines Learn? 











Lecture 7: The VC Dimension 


learning happens 
if finite dvc, large N, and low Ein 





Lecture 8: Noise and Error 
ə Noise and Probabilistic Target 
e Error Measure 

e Algorithmic Error Measure 

e Weighted Classification 





Q How Can Machines Learn? 
@ How Can Machines Learn Better? 
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Noise and Error Noise and Probabilistic Target 


Recap: The Learning Flow 


unknown target function 




















PX unknown 
* noise Poma 
Ф * 
X1, X2, ме? 1х 
давет" Эта 














7 3 : E 
training examples bini d final hypothesis 
D: Q(u,y1),: (XN Ум) Б gef 




















hypothesis set 
H 


what if there is noise? j 


Noise and Error Noise and Probabilistic Target 


Noise 






briefly introduced noise before pocket algorithm 












































age 23 years 
gender female А А : 
annual salary | NTD 1,000,000 позе уу : good customer, 
year in residence 1 year mislabeled' as bad? 
IR. и • noise in y: same customers, 








different labels? 


e noise in x: inaccurate 
customer information? 


credit? {no(—1), yes(+1)} 


does VC bound work under noise? | 
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Noise and Error Noise and Probabilistic Target 


Probabilistic Marbles 








one key of VC bound: marbles! 


‘deterministic’? marbles 
• marble x ~ P(x) 
e deterministic color 


[f(x) A h(x)] 






‘probabilistic’ (noisy) marbles 
e marble x ~ P(x) 


e probabilistic color 
[y A h(x)] with y ~ P(y|x) 


: iid. 
same nature: can estimate P[orange] if 5 





VC holds for x ^7 P(x), y "4% P(y|x) 
А 


xy) Р(х,у) 
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Noise and Error Noise and Probabilistic Target 


Target Distribution P(y|x) 
characterizes behavior of ‘mini-target’ on one x | 


• can be viewed as ‘ideal mini-target’ + noise, e.g. 

e P(o|x) = 0.7, P(x|x) = 0.3 

e ideal mini-target f(x) =o 

e ‘flipping’ noise level = 0.3 
e deterministic target f: special case of target distribution 
e P(y|x) = 1 for y = f(x) 
• P(y|x) = 0 for y 3 f(x) 





goal of learning: 


predict ideal mini-target (w.r.t. P(y|x)) 
on often-seen inputs (w.r.t. Р(х)) 





Noise and Error Noise and Probabilistic Target 


The New Learning Flow 









































-----------.. y 
unknown target idi 7 — 
distribution P(y|x) E dd AN 
containing f(x) + noise on < 3 
T Pá s s 
! X1, X2, Б: а “ani п 
КА УТ ШШ MM a „---77 Lc: 
I -* = | 
Y x m uod 
ыннаны best final hypothesis 
D: Qa,yi) , (XN, Ум) aie эү 




















hypothesis set 
^ 


VC still works, pocket algorithm explained :-) | 


Noise and Error Noise and Probabilistic Target 


Fun Time 





Let's revisit PLA/pocket. Which of the following claim is true? 


Q In practice, we should try to compute if D is linear separable 
before deciding to use PLA. 


Ө If we know that D is not linear separable, then the target function f 
must not be a linear function. 


© If we know that D is linear separable, then the target function f 
must be a linear function. 


© None of the above 










Reference Answer: (4) 


© After computing if D is linear separable, we 
shall know w* and then there is no need to use 
PLA. (2) What about noise? (3) What about 
‘sampling luck’? :-) 
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Noise and Error Error Measure 


Error Measure 





final hypothesis 
gaf 











e how well? previously, considered out-of-sample measure 
Eout(g) = „2 pla) # f(x)] 


e more generally, error measure Е(0, f) 
e naturally considered 


e out-of-sample: averaged over unknown x 
e pointwise: evaluated on one x 
e classification: [prediction Z target] 


classification error [. . .]: | 
often also called ‘0/1 error’ 
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Noise and Error Error Measure 


Pointwise Error Measure 
can often express E(g, f) — averaged err(g(X), f(X)), like 


Eou(g) = E [9(х) # f(x)] 
X Рх == 
err(9(x),f(x)) 









—err: called pointwise error measure 
out-of-sample 


Eou(9g) = „Ep lax); f(x)) 





N 
En(9) = уу 9 enia). х) 
ni 


will mainly consider pointwise err for simplicity | 
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Noise and Error Error Measure 


Two Important Pointwise Error Measures 





squared error 














еп(ў, y) = (ў = y}? 
• how far is y from y? 
e Often for regression 


how does err ‘guide’ learning? | 


еп(ў, y) = [y = y] 
e Correct or incorrect? 
e often for classification 





Noise and Error Error Measure 


Ideal Mini-Target 


interplay between noise and error: 
P(y|x) and err define ideal mini-target f(x) | 


Р(у = 1|x) = 0.2, Р(у = 2|х) = 0.7, P(y = 3|x) = 0.1 











еп(ӯ, y) = [y 7 y] 


1 avg. err 0.8 1 avg. err 1.1 
~ J 2 avg. err 0.3(*) 2 avg. err 0.3 
Y= з avg. err 0.9 3 avg. err 1.5 
1.9 avg. err 1.0(really? :-)) 1.9 avg. err 0.29(«) 


f(x) = So y. P(ylx) 


yey 


f(x) = argmax P(y|x) 
yey 
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Noise and Error Error Measure 


Learning Flow with Error Measure 


































-----------.. y 
unknown target n я РИМ 
distribution P(y|x) ^ i za 
containing f(x) + noise on i `, 
T ^ % A 
1 PS MS ` 
1 X1, X2, DX Saua x ` 
YisYoi coc SYN 1 NOIL "n . 
1 E Pag - i 
Ye | i а v 
Taming examples NE final hypothesis 
D: (X1, yi): (XN, Ум) Е: gef 


































error measure 
err 






hypothesis set 
H 






extended VC theory/‘philosophy’ 
works for most H and err | 


Noise and Error Error Measure 


Fun Time 





Consider the following P(y|x) and err(y, y) = |y — y|. Which of 
the following is the ideal mini-target f(x)? 














Р(у = 1|x) = 0.10, Р(у = 21x) = 0.35, 
P(y = 3|x) = 0.15, P(y = 4|x) = 0.40. 


@ 2.5 = average within Y = {1,2,3, 4} 
Ө 2.85 = weighted mean from P(y|x) 
© 3 = weighted median from P(y|x) 
Ө 4 = argmax P(y|x) 





Reference Answer: 9 


For the 'absolute error', the weighted median 
provably results in the minimum average err. 
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Noise and Error Algorithmic Error Measure 


Choice of Error Measure 







Fingerprint Verification 























+1 you 
— 
—] intruder 


two types of error: false accept and false reject 
g 
+1 -1 
+1 no error false reject 
false accept по error 











0/1 error penalizes both types equally | 


Noise and Error Algorithmic Error Measure 


Fingerprint Verification for Supermarket 






Fingerprint Verification 

















+1 you 
> 


—] intruder 











two types of error: false accept and false reject 














g g 
+1 -1 +1 -1 
+1 no error false reject ы 1.0: 10 
f f 
-1 | false accept по error zi ao = 10 





e supermarket: fingerprint for discount 
e false reject: very unhappy customer, lose future business 
e false accept: give away a minor discount, intruder left fingerprint :-) 





Noise and Error Algorithmic Error Measure 


Fingerprint Verification for CIA 






Fingerprint Verification 











+1 you 
> 


—] intruder 











two types of error: false accept and false reject 











g g 
+1 -1 +1 -1 
f +1 no error false reject f +1 0 
-1 | false accept no error -1 | 1000 0 





e CIA: fingerprint for entrance 
e false accept: very serious consequences! 
e false reject: unhappy employee, but so what? :-) 





Noise and Error Algorithmic Error Measure 


Take-home Message for Now 
err is application/user-dependent | 


Algorithmic Error Measures err 
e true: just err 
e plausible: 
e 0/1: minimum ‘flipping noise-—NP-hard to optimize, remember? :-) 
e squared: minimum Gaussian noise 
e friendly: easy to optimize for A 


e closed-form solution 
e convex objective function 









err: more in next lectures | 


Noise and Error Algorithmic Error Measure 


Learning Flow with Algorithmic Error Measure 
















































BeOS ы d 
unknown target s 7 ресе 
distribution Р(у|х) p^ bp A 
containing f(x) + noise Bus A 
T » b A 
i X5,X2;777 X^ “ani x x 
YisYosi coc YN а шыл m BL i 
1 PL `. | 
Y x^ — "a v 
training examples A. final hypothesis 
D: (хү, уч), (XN Ум) н gmf 
em 































error measure 
err 






hypothesis set 
H 






err: application goal; 
err: a key part of many .A | 


Noise and Error Algorithmic Error Measure 


Fun Time 
Consider err below for CIA. What is En(g) when using this err? 





90 d [уп 5 g(xn)] 


g 
+1 sil 
1 





+1 
-1 | 1000 


Э [уп Z 9(хп)] + 1000 >; [yn 4 (Xn) 


Уп=+ Уп=—1 


У; [yn 9(хп)] – 1000 >; [Yn 9(X») 


Уп=+1 уп=—1 


1000 d [Ул + 9(Xn)] + m Уп + (Xn) 


уп 

















Ox 


z|- 
Ae ei 
ne ишш 5 









Reference Answer: (2) 


When yn = —1, the false positive made on 
Such (Xn, Ул) is penalized 1000 times more! 
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Noise and Error Weighted Classification 


Weighted Classification 
CIA Cost (Error, Loss, ...) Matrix 















out-of-sample 












1 if y = +1 
Fout(h) = ad 1000 if y = = c [У га h(x)] 






N 
Eis (h) = wd { A | = = m [Yn + h(Xn)] 





weighted classification: 
different ‘weight’ for different (x, y) 
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Noise and Error Weighted Classification 


Minimizing Ei, for Weighted Classification 





л ЕУ CS 
EM = s] 1000 if yp = —1 JD hix) 











Naive Thoughts 
e PLA: doesn't matter if linear separable. :-) 


e pocket: modify pocket-replacement rule 
—if үү reaches smaller E? than W, replace W by wy, 





pocket: some guarantee on Е/!; 
modified pocket: similar guarantee on EN? 





Noise and Error Weighted Classification 


Systematic Route: Connect E" and E?! 


original problem 





| h(x) 
+1 -1 
+1 0 1 
у ү 0 
(X4, +1) 
(х2, —1) 
D (хз, —1) 
(Xy_1, +1) 
(Xn, +1) 





equivalent problem 








h(x) 

+1 -1 

+1 0 1 

У | 1 о 
(х1,+1) 


(х2, —1), (Xo, —1), а (х2, —1) 
(хз, eub (Xs, =); 2050 (хз, zn 
(ху—1,+1) 

(хм, +1) 





after copying —1 examples 1000 times, 
E" for LHS = Е?/! for RHS! 


Noise and Error Weighted Classification 


Weighted Pocket Algorithm 








using 'virtual copying', weighted pocket algorithm include: 
e weighted PLA: 


randomly check —1 example mistakes with 1000 times more 
probability 


e weighted pocket replacement: 
if w;}1 reaches smaller E? than W, replace W by wy, 


systematic route (called ‘reduction’): 
can be applied to many other algorithms! | 
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Noise and Error Weighted Classification 


Fun Time 






Consider the CIA cost matrix. If there are 10 examples with 
Yn = —1 (intruder) and 999, 990 examples with y; = +1 (you). 
What would E?'(h) be for a constant h(x) that always returns +1? 









Reference Answer: (2) 


While the quiz is a simple evaluation, it is not 
uncommon that the data is very unbalanced 
for such an application. Properly ‘setting’ the 
weights can be used to avoid the lazy constant 
prediction. 
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Noise and Error Weighted Classification 


Summary 
@ When Can Machines Learn? 
@ Why Can Machines Learn? 


Lecture 7: The VC Dimension 


Lecture 8: Noise and Error 


e Noise and Probabilistic Target 
can replace f(x) by P(y|x) 











e Error Measure 
affect 'ideal' target 
e Algorithmic Error Measure 
user-dependent — plausible or friendly 
e Weighted Classification 
easily done by virtual 'example copying' 





e next: more algorithms, please? :-) 
© How Can Machines Learn? 
@ How Can Machines Learn Better? 
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